Many data manipulation and cleaning functions
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.
You will be required to import pandas as ‘pd’ and then use ‘pd’ object to perform other basic pandas operations.
import pandas as pd
import numpy as np
print('pandas', pd.__version__)
print('numpy', np.__version__)
pandas 1.1.5 numpy 1.19.5
Some more information here
Typically used in intermediate stage data processing and conversion to pandas data frame
#Create a Dictionary of series
data = {'Name':['Wood', 'Winston', 'Williams', 'White', 'Lorenzen'],
'Weight':[255,231,195,185,285],
'No':[49,2,43,14, 13],
'Pos':['LS', 'QB', 'FS', 'WR', 'QB']}
df1 = pd.DataFrame(data)
print(df1)
Name Weight No Pos 0 Wood 255 49 LS 1 Winston 231 2 QB 2 Williams 195 43 FS 3 White 185 14 WR 4 Lorenzen 285 13 QB
Gitgub Raw
raw
fileraw.githubusercontent.com
domain is used to serve unprocessed versions of files stored in GitHub repositories. GitHub
and then click theRaw link
to get the raw
file# import pandas as pd
download_url = "https://raw.githubusercontent.com/sdhar-pycourse/py-datascience-biz/main/practise-datasets/bivar.csv"
df2 = pd.read_csv(download_url)
for df in [df1, df2]:
print('Object Type:', type(df), df.size, 'in kb')
# /1024*1024
Object Type: <class 'pandas.core.frame.DataFrame'> 20 in kb Object Type: <class 'pandas.core.frame.DataFrame'> 546 in kb
from google.colab import files
df2.to_csv('filename.csv')
files.download('filename.csv')
.set_options()
Pandas supports environmental setting for easy browsing of data
#configure Pandas to display all columns
pd.set_option("display.max.columns", None)
print('Max Num of Cols:', pd.options.display.max_columns)
# Change precision to 2.
pd.set_option('display.precision', 2)
print('float precision:', pd.options.display.precision)
Max Num of Cols: None float precision: 2
pd.DataFrames
are the large scale data structures in Pandaspd.Series
type(df2.x4)
pandas.core.series.Series
print('Decompising main Pandas Data Structure:')
print()
print('Data Structure: ', type(df2))
print('Series: ', type(df2['x2']), type(df2.x4))
print('Index: ', type(df2.index))
print('Underlying NumPy array in Series:', type(df2.x2.values))
print('Underlying NumPy array in Index: ', type(df2.index.values))
Decompising main Pandas Data Structure: Data Structure: <class 'pandas.core.frame.DataFrame'> Series: <class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'> Index: <class 'pandas.core.indexes.range.RangeIndex'> Underlying NumPy array in Series: <class 'numpy.ndarray'> Underlying NumPy array in Index: <class 'numpy.ndarray'>
.shape()
et alPandas .size
, .shape
and .ndim
are used to return size, shape and dimensions of data frames and series
print(df2.shape)
(78, 7)
df2.size
546
df2.ndim
2
.head()
Pandas head()
method is used to return top n
(5 by default) rows of a data frame or series.
print(df2.head()) # if no value given it will print first 5 rows
idx x3 x2 x1 y2 y1 x4 0 1 1 4 3 1 1 4 1 2 0 2 3 0 1 3 2 3 0 1 3 1 0 2 3 4 1 1 3 1 1 2 4 5 0 1 3 1 1 2
print(df2.head(10))
idx x3 x2 x1 y2 y1 x4 0 1 1 4 3 1 1 4 1 2 0 2 3 0 1 3 2 3 0 1 3 1 0 2 3 4 1 1 3 1 1 2 4 5 0 1 3 1 1 2 5 6 0 1 3 0 1 2 6 7 1 2 2 0 1 2 7 8 0 1 3 0 0 2 8 9 0 2 1 0 0 1 9 10 1 2 3 1 1 2
.tail()
Pandas tail()
method is used to return bottom n
(5 by default) rows of a data frame or series.
print(df.tail())
idx x3 x2 x1 y2 y1 x4 73 74 0 2 1 0 0 1 74 75 0 1 3 0 1 2 75 76 0 4 3 1 0 2 76 77 0 1 2 0 0 1 77 78 1 3 1 1 1 2
df.tail(10)
idx | x3 | x2 | x1 | y2 | y1 | x4 | |
---|---|---|---|---|---|---|---|
68 | 69 | 0 | 1 | 2 | 1 | 0 | 2 |
69 | 70 | 0 | 3 | 1 | 1 | 1 | 2 |
70 | 71 | 0 | 2 | 3 | 0 | 1 | 3 |
71 | 72 | 0 | 2 | 2 | 0 | 0 | 1 |
72 | 73 | 1 | 3 | 1 | 1 | 1 | 2 |
73 | 74 | 0 | 2 | 1 | 0 | 0 | 1 |
74 | 75 | 0 | 1 | 3 | 0 | 1 | 2 |
75 | 76 | 0 | 4 | 3 | 1 | 0 | 2 |
76 | 77 | 0 | 1 | 2 | 0 | 0 | 1 |
77 | 78 | 1 | 3 | 1 | 1 | 1 | 2 |
.describe()
Generate descriptive statistics for numerical variables
Descriptive statistics include those that summarize the mean, median and quartiles
print(df2.describe())
print()
print(type(df2.describe()))
idx x3 x2 x1 y2 y1 x4 count 78.00 78.00 78.00 78.00 78.00 78.00 78.00 mean 39.50 0.10 1.81 2.14 0.37 0.54 1.81 std 22.66 0.31 0.97 0.75 0.49 0.50 0.79 min 1.00 0.00 1.00 1.00 0.00 0.00 1.00 25% 20.25 0.00 1.00 2.00 0.00 0.00 1.00 50% 39.50 0.00 1.00 2.00 0.00 1.00 2.00 75% 58.75 0.00 2.75 3.00 1.00 1.00 2.00 max 78.00 1.00 4.00 3.00 1.00 1.00 4.00 <class 'pandas.core.frame.DataFrame'>
df1.describe(include= 'all')
Name | Weight | No | Pos | |
---|---|---|---|---|
count | 5 | 5.00 | 5.00 | 5 |
unique | 5 | NaN | NaN | 4 |
top | Lorenzen | NaN | NaN | QB |
freq | 1 | NaN | NaN | 2 |
mean | NaN | 230.20 | 24.20 | NaN |
std | NaN | 41.54 | 20.56 | NaN |
min | NaN | 185.00 | 2.00 | NaN |
25% | NaN | 195.00 | 13.00 | NaN |
50% | NaN | 231.00 | 14.00 | NaN |
75% | NaN | 255.00 | 43.00 | NaN |
max | NaN | 285.00 | 49.00 | NaN |
.value_counts()
Return a Series containing counts of unique values.
The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
df2.y2.value_counts()
0 49 1 29 Name: y2, dtype: int64
df1.Pos.value_counts(sort=False)
QB 2 LS 1 WR 1 FS 1 Name: Pos, dtype: int64
print(type(df1.Pos.value_counts(sort=False)))
print(df1.Pos.value_counts(sort=False).index.values)
<class 'pandas.core.series.Series'> ['QB' 'LS' 'WR' 'FS']
.info()
Prints a concise summary of a DataFrame.
This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.
DataFrame.info(verbose=None, buf=None, max_cols=None, memory_usage=None, show_counts=None, null_counts=None) [link text](https://)
df2.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 78 entries, 0 to 77 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 idx 78 non-null int64 1 x3 78 non-null int64 2 x2 78 non-null int64 3 x1 78 non-null int64 4 y2 78 non-null int64 5 y1 78 non-null int64 6 x4 78 non-null int64 dtypes: int64(7) memory usage: 4.4 KB
.dtypes()
Return the dtypes in the DataFrame.
This returns a Series
with the data type of each column.
The result’s index is the original DataFrame’s columns. Columns with mixed types are stored with the object dtype.
df1.dtypes
Name object Weight int64 No int64 Pos object dtype: object
index
Indexes are Immutable sequence used for indexing and alignment. The basic object storing axis labels for all pandas objects.
The axis labeling information in pandas objects serves many purposes:
Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display.
In this section, we will focus on the final point: namely, how to slice, dice, and generally get and set subsets of pandas objects. The primary focus will be on Series and DataFrame as they have received more development attention in this area.
print(df1)
print()
df1 = pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e'])
print(df1)
Name Weight No Pos 0 Wood 255 49 LS 1 Winston 231 2 QB 2 Williams 195 43 FS 3 White 185 14 WR 4 Lorenzen 285 13 QB Name Weight No Pos a Wood 255 49 LS b Winston 231 2 QB c Williams 195 43 FS d White 185 14 WR e Lorenzen 285 13 QB
DataFrame.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='')
Reset the index, or a level of it.
Reset the index of the DataFrame, and use the default one instead. If the DataFrame has a MultiIndex, this method can remove one or more levels.
MultiIndex- Will keep put of scope for now. If you are interested you can look here
df1.reset_index() # this won't make any change in df1
print(df1)
Name Weight No Pos a Wood 255 49 LS b Winston 231 2 QB c Williams 195 43 FS d White 185 14 WR e Lorenzen 285 13 QB
df1.reset_index(drop = True,inplace=True)
print(df1)
Name Weight No Pos 0 Wood 255 49 LS 1 Winston 231 2 QB 2 Williams 195 43 FS 3 White 185 14 WR 4 Lorenzen 285 13 QB
.drop
DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
Drop specified labels from rows or columns.
Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.
df3= df1.copy(deep=True)
df3
Name | Weight | No | Pos | |
---|---|---|---|---|
a | Wood | 255 | 49 | LS |
b | Winston | 231 | 2 | QB |
c | Williams | 195 | 43 | FS |
d | White | 185 | 14 | WR |
e | Lorenzen | 285 | 13 | QB |
Dropping one row
df3.drop('c', axis=0)
Name | Weight | No | Pos | |
---|---|---|---|---|
a | Wood | 255 | 49 | LS |
b | Winston | 231 | 2 | QB |
d | White | 185 | 14 | WR |
e | Lorenzen | 285 | 13 | QB |
Dropping more than one row
df3.drop(['a','d'])
Name | Weight | No | Pos | |
---|---|---|---|---|
b | Winston | 231 | 2 | QB |
c | Williams | 195 | 43 | FS |
e | Lorenzen | 285 | 13 | QB |
df3.drop('Name',axis = 1) # axis = 1 is for columns
Weight | No | Pos | |
---|---|---|---|
a | 255 | 49 | LS |
b | 231 | 2 | QB |
c | 195 | 43 | FS |
d | 185 | 14 | WR |
e | 285 | 13 | QB |
Above code can also be written as
df3.drop('Name',axis = 'columns')
Weight | No | Pos | |
---|---|---|---|
a | 255 | 49 | LS |
b | 231 | 2 | QB |
c | 195 | 43 | FS |
d | 185 | 14 | WR |
e | 285 | 13 | QB |
Dropping More than one columns
df3.drop(['Name','No'],axis = 1)
Weight | Pos | |
---|---|---|
a | 255 | LS |
b | 231 | QB |
c | 195 | FS |
d | 185 | WR |
e | 285 | QB |
drop has a default setting as inplace= False
. This basically means that it drops the columns/ index when creating a output with the command but leaves the object in the memory without any change.
This can be manipulated by toggling to inplace= True
df3.drop('Name',axis = 'columns',inplace=True)
print(df3)
Weight No Pos a 255 49 LS b 231 2 QB c 195 43 FS d 185 14 WR e 285 13 QB
df3.columns
Index(['Weight', 'No', 'Pos'], dtype='object')
rename
DataFrame.rename(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None, errors='ignore')
Alter axes labels.
Function / dict values must be unique (1-to-1).
Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.
df1
Name | Weight | No | Pos | |
---|---|---|---|---|
a | Wood | 255 | 49 | LS |
b | Winston | 231 | 2 | QB |
c | Williams | 195 | 43 | FS |
d | White | 185 | 14 | WR |
e | Lorenzen | 285 | 13 | QB |
df1.columns
Index(['Name', 'Weight', 'No', 'Pos'], dtype='object')
df3 = df3.rename(columns={'No':'Number', 'Pos': 'Position'})
print(df3)
Weight Number Position a 255 49 LS b 231 2 QB c 195 43 FS d 185 14 WR e 285 13 QB
.sort_values()
DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)
Sort by the values along either axis 0 or 1
df3
Weight | Number | Position | |
---|---|---|---|
a | 255 | 49 | LS |
b | 231 | 2 | QB |
c | 195 | 43 | FS |
d | 185 | 14 | WR |
e | 285 | 13 | QB |
df3.sort_values(by='Weight', ascending=False).head(5)
Weight | Number | Position | |
---|---|---|---|
e | 285 | 13 | QB |
a | 255 | 49 | LS |
b | 231 | 2 | QB |
c | 195 | 43 | FS |
d | 185 | 14 | WR |
df3.sort_values(by=['Position', 'Weight'])
Weight | Number | Position | |
---|---|---|---|
c | 195 | 43 | FS |
a | 255 | 49 | LS |
b | 231 | 2 | QB |
e | 285 | 13 | QB |
d | 185 | 14 | WR |
df3.sort_values(by=['Position', 'Weight'],ascending= False)
Weight | Number | Position | |
---|---|---|---|
d | 185 | 14 | WR |
e | 285 | 13 | QB |
b | 231 | 2 | QB |
a | 255 | 49 | LS |
c | 195 | 43 | FS |
Querying requires the following to happen:
.loc
and .iloc
The main distinction between the two methods is:
.loc()
gets rows (and/or columns) with particular labels.
.iloc()
gets rows (and/or columns) at integer locations.
df2
idx | x3 | x2 | x1 | y2 | y1 | x4 | |
---|---|---|---|---|---|---|---|
0 | 1 | 1 | 4 | 3 | 1 | 1 | 4 |
1 | 2 | 0 | 2 | 3 | 0 | 1 | 3 |
2 | 3 | 0 | 1 | 3 | 1 | 0 | 2 |
3 | 4 | 1 | 1 | 3 | 1 | 1 | 2 |
4 | 5 | 0 | 1 | 3 | 1 | 1 | 2 |
... | ... | ... | ... | ... | ... | ... | ... |
73 | 74 | 0 | 2 | 1 | 0 | 0 | 1 |
74 | 75 | 0 | 1 | 3 | 0 | 1 | 2 |
75 | 76 | 0 | 4 | 3 | 1 | 0 | 2 |
76 | 77 | 0 | 1 | 2 | 0 | 0 | 1 |
77 | 78 | 1 | 3 | 1 | 1 | 1 | 2 |
78 rows × 7 columns
df2.loc[4]
idx 5 x3 0 x2 1 x1 3 y2 1 y1 1 x4 2 Name: 4, dtype: int64
df2.loc[2:10:2,['idx', 'y2', 'x4']]
idx | y2 | x4 | |
---|---|---|---|
2 | 3 | 1 | 2 |
4 | 5 | 1 | 2 |
6 | 7 | 0 | 2 |
8 | 9 | 0 | 1 |
10 | 11 | 0 | 1 |
df2.loc[4,['x4', 'y1']]
x4 2 y1 1 Name: 4, dtype: int64
df2.iloc[0:2,0:6:2]
idx | x2 | y2 | |
---|---|---|---|
0 | 1 | 4 | 1 |
1 | 2 | 2 | 0 |
df2.iloc[4, [6,5]]
x4 2 y1 1 Name: 4, dtype: int64
df2.iloc[4, 3:6]
x1 3 y2 1 y1 1 Name: 4, dtype: int64
df2.iloc[4, 3:6:2]
x1 3 y1 1 Name: 4, dtype: int64
df1
Name | Weight | No | Pos | |
---|---|---|---|---|
a | Wood | 255 | 49 | LS |
b | Winston | 231 | 2 | QB |
c | Williams | 195 | 43 | FS |
d | White | 185 | 14 | WR |
e | Lorenzen | 285 | 13 | QB |
df1.loc[df1.Name == 'Williams']
Name | Weight | No | Pos | |
---|---|---|---|---|
2 | Williams | 195 | 43 | FS |
df1.loc[(df1.Pos.isin(['LS', 'QB'])) & (df1.Weight> 215)]
Name | Weight | No | Pos | |
---|---|---|---|---|
0 | Wood | 255 | 49 | LS |
1 | Winston | 231 | 2 | QB |
4 | Lorenzen | 285 | 13 | QB |
df2.loc[0]
idx 1 x3 1 x2 4 x1 3 y2 1 y1 1 x4 4 Name: 0, dtype: int64
df2.loc[(df2.y1==1) & (df2.x2 >= 2),'x4']
0 4 1 3 6 2 9 2 12 2 13 4 14 3 16 2 25 2 29 3 34 2 35 1 39 2 40 2 43 2 56 2 57 2 58 2 59 3 62 3 64 2 66 3 69 2 70 3 72 2 77 2 Name: x4, dtype: int64
pd.read_excel('water readings.xls').plot('Read Date', 'Gallons Used')
<matplotlib.axes._subplots.AxesSubplot at 0x7fed6c18fa10>